Named Entity Extraction from Speech

نویسندگان

  • Francis Kubala
  • Richard Schwartz
  • Rebecca Stone
  • Ralph Weischedel
چکیده

We report results using a hidden Markov model to extract information from broadcast news. IdentiFinderTM was trained on the broadcast news corpus and tested on both the 1996 HUB-4 development test data and the 1997 HUB-4 evaluation test data with respect to the named entity (NE) task: extracting • names of locations, persons, and organizations; • dates and times; • monetary amounts and percentages. Evaluation is based on automatic word alignment of the speech recognition output (the NIST algorithm) followed by the MUC6/MUC-7 scorer for NE on text, since MUC scoring assumes identical text in the system output and in the answer key. Additionally, we used the experimental MITRE scoring metric (Burger, et al., 1998). The most encouraging result is that a language-independent, trainable information extraction algorithm degraded on speech input at most by the word error rate of the recognizer. 1. MOTIVATING FACTORS One of the reasons behind this effort is to go beyond speech transcription (e.g. beyond the dictation problem) to address (at least) shallow understanding of speech. As a result of this effort, we believe that evaluating named entity (NE) extraction from speech offers a measure complementary to word error rate (wer) and represents a measure of understanding. The scores for NE from speech seem to track quality of speech recognition proportionally, i.e., NE performance degrades at worst linearly with word error rate. A second motivation is the fact that NE is the first information extraction task from text showing success, with error rates on newswire less than 10%. The named entity problem has generated much interest, as evidenced by its inclusion as an understanding task to be evaluated in both the Sixth and Seventh Message Understanding Conferences (MUC-6 and MUC-7), in the First and Second Multilingual Entity Task evaluations (MET-1 and MET-2), and as a planned track in the next broadcast news evaluation. Furthermore, at least one commercial product has emerged: NameTagTM from IsoQuest. NE is defined by a set of annotation guidelines, an evaluation metric, and example data (Chinchor, 1997). 2. THE NAMED ENTITY PROBLEM FOR SPEECH The named entity task is to identify all named locations, named persons, named organizations, dates, times, monetary amounts, and percentages. Though this sounds clear, enough special cases arise to require lengthy guidelines, e.g., when i s The Wall Street Journal an artifact, and when is it an organization? When is White House an organization, and when a location? Are branch offices of a bank an organization? Is a street name a location? Should yesterday and last Tuesday be labeled dates? Is mid-morning a time? For human annotator consistency, guidelines with numerous special cases have been defined for the Seventh Message Understanding Conference, MUC-7 (Chinchor, 1997). In training data, the boundaries of an expression and its type must be marked via SGML. Various GUIs support manual preparation of training data and reference answers. Though the problem is relatively easy in mixed case English prose, this is not solvable solely by recognizing capitalization in English. Though capitalization does indicate proper nouns in English, the type of the entity (person, organization, location, or none of those) must be identified. Many proper noun categories are not to be marked, e.g., nationalities, product names, and book titles. Named entity recognition is a challenge where case does not signal proper nouns, e.g., in Chinese, Japanese, German or non-text modalities (e.g., speech). Since the task was generalized to other languages in the multi-lingual entity task (MET), the task definition is no longer dependent on the use of mixed case in English. Broadcast news presents significant challenges, as illustrated in Table 1. Not having mixed case removes information useful to recognizing names in English. Automatically transcribed speech, even with no recognition errors, is harder due to the lack of punctuation, spelling numbers out as words, and upper case in SNOR (Speech Normalized Orthographic Representation) format. 3. OVERVIEW OF HMM IN IDENTIFINDERTM A full description of our HMM for named entity extraction appears in Bikel, et. al., 1997. By definition of the task, only a single label can be assigned to a word in context. Therefore, to every word, the HMM will assign either one of the desired classes (e.g., person, organization, etc.) or the label NOT-ANAME (to represent “none of the desired classes”). We organize the states into regions, one region for each desired class plus one for NOT-A-NAME. See Figure 1. The HMM will have a model of each desired class and of the other text. The implementation is not confined to the seven classes of NE; in fact, it determines the set of classes by the SGML labels in the training data. Additionally, there are two special states, the START-OF-SENTENCE and END-OF-SENTENCE states. Within each of the regions, we use a statistical bigram language model, and emit exactly one word upon entering each state. Therefore, the number of states in each of the nameclass regions is equal to the vocabulary size, V . The generation of words and name-classes proceeds in the following steps: 1. Select a name-class NC, conditioning on the previous name-class and the previous word. 2. Generate the first word inside that name-class, conditioning on the current and previous nameclasses. 3. Generate all subsequent words inside the current name-class, where each subsequent word i s conditioned on its immediate predecessor. 4. If not at the end of a sentence, go to 1. Using the Viterbi algorithm, we search the entire space of all possible name-class assignments, maximizing Pr(W, NC). This model allows each type of “name” to have its own language, with separate bigram probabilities for generating its words. This reflects our intuition that • There is generally predictive internal evidence regarding the class of a desired entity. Consider the following evidence: organization names tend to be stereotypical for airlines, utilities, law firms, insurance companies, other corporations, and government organizations. Organizations tend to select names to suggest the purpose or type of the organization. For person names, first person names are stereotypical in many cultures; in Chinese, family names are stereotypical. In Chinese and Japanese, special characters are used to transliterate foreign names. Monetary amounts typically include a unit term, e.g., Taiwan dollars, yen, German marks, etc. • Local evidence often suggests the boundaries and class of one of the desired expressions. Titles signal beginnings of person names. Closed class words, such as determiners, pronouns, and prepositions often signal a boundary. Corporate designators (Inc, Ltd., Corp., etc.) often end a corporation name. While the number of word-states within each name-class i s equal to V , this “interior” bigram language model is ergodic, Mixed Case The crash was the second of a 757 in less than two months. On Dec. 20, an American Airlines jet crashed in the mountains near Cali, Colombia, killing 160 of th 164 people on board. The cause of that crash is still under investigation. UPPER CASE THE CRASH WAS THE SECOND OF A 757 IN LESS THAN TWO MONTHS. ON DEC. 20, AN AMERICAN AIRLINES JET CRASHED IN THE MOUNTAINS NEAR CALI, COLOMBIA, KILLING 160 OF TH 164 PEOPLE ON BOARD. THE CAUSE OF THAT CRASH IS STILL UNDER INVESTIGATION. SNOR THE CRASH WAS THE SECOND OF A SEVEN FIFTY SEVEN IN LESS THAN TWO MONTHS ON DECEMBER TWENTY AN AMERICAN AIRLINES JET CRASHED IN THE MOUNTAINS NEAR CALI COLOMBIA KILLING ONE HUNDRED SIXTY OF THE ONE HUNDRED SIXTY FOUR PEOPLE ON BOARD THE CAUSE OF THAT CRASH IS STILL UNDER INVESTIGATION Table 1: Illustration of difficulties presented by speech recognition output (SNOR).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Named entity extraction from word lattices

We present a method for named entity extraction from word lattices produced by a speech recogniser. Previous work by others on named entity extraction from speech has used either a manual transcript or 1-best recogniser output. We describe how a single Viterbi search can recover both the named entity sequence and the corresponding word sequence from a word lattice, and further that it is possib...

متن کامل

Cross domain Chinese speech understanding and answering based on named-entity extraction

Chinese language is not alphabetic, with flexible wording structure and large number of domain-specific terms generated every day for each domain. In this paper, a new approach for cross-domain Chinese speech understanding and answering is proposed based on named-entity extraction. This approach includes two parts: a speech query recognition (SQR) part and a speech understanding and answering (...

متن کامل

تشخیص اسامی اشخاص با استفاده از تزریق کلمه‌های نامزد اسم در میدان‌های تصادفی شرطی برای زبان عربی

Named Entity Recognition and Extraction are very important tasks for discovering proper names including persons, locations, date, and time, inside electronic textual resources. Accurate named entity recognition system is an essential utility to resolve fundamental problems in question answering systems, summary extraction, information retrieval and extraction, machine translation, video interpr...

متن کامل

OOV Sensitive Named-Entity Recognition in Speech

Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named e...

متن کامل

Robust Named Entity Extraction from Large Spoken Archives

Traditional approaches to Information Extraction (IE) from speech input simply consist in applying text based methods to the output of an Automatic Speech Recognition (ASR) system. If it gives satisfaction with low Word Error Rate (WER) transcripts, we believe that a tighter integration of the IE and ASR modules can increase the IE performance in more difficult conditions. More specifically thi...

متن کامل

Information Extraction from Voicemail

In this paper we address the problem of extracting key pieces of information from voicemail messages, such as the identity and phone number of the caller. This task differs from the named entity task in that the information we are interested in is a subset of the named entities in the message, and consequently, the need to pick the correct subset makes the problem more difficult. Also, the call...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998